Building Useful Models from Imbalanced Data with Sampling and Boosting

نویسندگان

  • Chris Seiffert
  • Taghi M. Khoshgoftaar
  • Jason Van Hulse
  • Amri Napolitano
چکیده

Building useful classification models can be a challenging endeavor, especially when training data is imbalanced. Class imbalance presents a problem when traditional classification algorithms are applied. These algorithms often attempt to build models with the goal of maximizing overall classification accuracy. While such a model may be very accurate, it is often not very useful. Consider the domain of software quality prediction where the goal is to identify program modules that are most likely to contain faults. Since these modules make up only a small fraction of the entire project, a highly accurate model may be generated by classifying all examples as not fault prone. Such a model would be useless. To alleviate the problems associated with class imbalance, several techniques have been proposed. We examine two such techniques: data sampling and boosting. Five data sampling techniques and one commonly used boosting algorithm are applied to five datasets from the software quality prediction domain. Our results suggest that while data sampling can be very effective at improving classification performance when training data is imbalance, boosting (which has received considerably less attention in research related to mining imbalanced data) usually results in even better perfor-

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Enhancing Learning from Imbalanced Classes via Data Preprocessing: A Data-Driven Application in Metabolomics Data Mining

This paper presents a data mining application in metabolomics. It aims at building an enhanced machine learning classifier that can be used for diagnosing cachexia syndrome and identifying its involved biomarkers. To achieve this goal, a data-driven analysis is carried out using a public dataset consisting of 1H-NMR metabolite profile. This dataset suffers from the problem of imbalanced classes...

متن کامل

Combining Feature Selection and Ensemble Learning for Software Quality Estimation

High dimensionality is a major problem that affects the quality of training datasets and therefore classification models. Feature selection is frequently used to deal with this problem. The goal of feature selection is to choose the most relevant and important attributes from the raw dataset. Another major challenge to building effective classification models from binary datasets is class imbal...

متن کامل

CUSBoost: Cluster-based Under-sampling with Boosting for Imbalanced Classification

Class imbalance classification is a challenging research problem in data mining and machine learning, as most of the real-life datasets are often imbalanced in nature. Existing learning algorithms maximise the classification accuracy by correctly classifying the majority class, but misclassify the minority class. However, the minority class instances are representing the concept with greater in...

متن کامل

Comparing Two Approaches for Adding Feature Ranking to Sampled Ensemble Learning for Software Quality Estimation

High dimensionality and class imbalance are two main problems that affect the quality of training datasets in software defect prediction, resulting in inefficient classification models. Feature selection and data sampling are often used to overcome these problems. Feature selection is a process of choosing the most important attributes from the original data set. Data sampling alters the data s...

متن کامل

Amazon Employee Access Control System

In this work, based on the history data of 20102011 from Amazon Inc., we build up a system which aims to take place of resource administrators at Amazon. Our analysis shows that the given dataset is highly imbalanced with categorical values. Thus in the preprocessing step, we tried different sampling methods, feature selection as well as one hot encoding to make the data more suitable for predi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008